The Treegram Index|an Eecient Technique for Retrieval in Linguistic Treebanks under Consideration for Other Conferences (specify)? Acl

ثبت نشده

چکیده

In computational linguistics, large tree databases tagged with morpho-syntactic information are in need of fast retrieval of multiway tree structures. To tackle this problem, we present a generalization of the classical n-gram indexing technique called Treegram indexing. As an application of treegram indexing, we describe the Venona retrieval system, which handles the BH t treebank containing 508,650 phrase structure trees. 1 Tree Retrieval Multiway trees (MT, henceforth) play a central role in representing complex linguistic information because they are a common and well-understood data structure for describing hierarchical information. With the availability of large treebanks, retrieval techniques for highly structured data now become essential. One of the most well-known linguistic tree repositories is the Penn treebank of the University of Pennsylvania: Its fundament consists of a corpus containing 4.5 million words of American English; half of this corpus has been annotated for skeletal syntac-tical structure, cf. (Marcus et al., 1993).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Treegram Index-An Efficient Technique for Retrieval in Linguistic Treebanks

Multiway trees (MT, henceforth) are a common and well-understood data structure for describing hierarchical linguistic information. With the availability of large treebanks, retrieval techniques for highly structured data now become essential. In this contribution, we investigate the efficient retrieval of MT structures at the cost of a complex index--the Treegram Index. We illustrate our appro...

متن کامل

Multiway-Tree Retrieval Based on Treegrams

Large tree databases as knowledge repositories become more and more important; a prominent example are the treebanks in computational linguistics: text corpora consisting of up to five million words tagged with syntactic information. Consequently, these large amounts of structured data pose the problem of fast tree retrieval: Given a database T of labeled multiway trees and a query tree q, find...

متن کامل

Eecient Parsing for Bilexical Context-free Grammars and Head Automaton Grammars

Word Count: 3199 (using detex 2.6) Under consideration for other conferences (specify)? no Abstract Several recent stochastic parsers use bilexical grammars, where each word type idiosyncratically prefers particular complements with particular head words. We present O(n 4) parsing algorithms for two bilexical formalisms, improving the previous upper bounds of O(n 5). Also, for a common special ...

متن کامل

Computing Translation Units and Quantifying Parallelism in Parallel Dependency Treebanks

The linguistic quality of a parallel treebank depends crucially on the parallelism between the source and target language annotations. We propose a linguistic notion of translation units and a quantitative measure of parallelism for parallel dependency treebanks, and demonstrate how the proposed translation units and parallelism measure can be used to compute transfer rules, spot annotation err...

متن کامل

Eecient Probabilistic Top-down and Left-corner Parsing Submission Type: Thematic Session Topic Areas or Theme Id: M5 Word Count: 3170 under Consideration for Other Conferences (specify)? None

This paper examines eecient predictive broad-coverage parsing without dynamic programming. In contrast to bottom-up methods, top-down parsing produces partial parses that are fully connected trees spanning the entire left context, from which any kind of non-local dependency or partial semantic interpretation can in principle be read. We contrast top-down and left-corner parsing, and nd both to ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

The Treegram Index|an Eecient Technique for Retrieval in Linguistic Treebanks under Consideration for Other Conferences (specify)? Acl

ثبت نشده

چکیده

منابع مشابه

The Treegram Index-An Efficient Technique for Retrieval in Linguistic Treebanks

Multiway-Tree Retrieval Based on Treegrams

Eecient Parsing for Bilexical Context-free Grammars and Head Automaton Grammars

Computing Translation Units and Quantifying Parallelism in Parallel Dependency Treebanks

Eecient Probabilistic Top-down and Left-corner Parsing Submission Type: Thematic Session Topic Areas or Theme Id: M5 Word Count: 3170 under Consideration for Other Conferences (specify)? None

عنوان ژورنال:

اشتراک گذاری